wget
Section: User Commands (1)
Updated: 1996 Nov 11
Index
Return to Main Contents
NAME
wget - a utility to retrieve files from the World Wide Web
SYNOPSIS
wget [options] [URL-list]
WARNING
The information in this man page is an extract from the full
documentation of
Wget.
It could very well be out of date. Please refer to the info page for
full, up-to-date documentation. You can view the info documentation
with the Emacs info subsystem or the standalone info program.
DESCRIPTION
Wget
is a utility designed for retrieving binary documents across the Web,
through the use of HTTP (Hyper Text Transfer Protocol) and
FTP (File Transfer Protocol), and saving them to disk.
Wget
is non-interactive, which means it can work in the background, while
the user is not logged in, unlike most of web browsers (thus you may
start the program and log off, letting it do its work). Analysing
server responses, it distinguishes between correctly and incorrectly
retrieved documents, and retries retrieving them as many times as
necessary, or until a user-specified limit is reached. REST is
used in FTP on hosts that support it. Proxy servers are
supported to speed up the retrieval and lighten network load.
Wget
supports a full-featured recursion mechanism, through which you can
retrieve large parts of the web, creating local copies of remote
directory hierarchies. Of course, maximum level of recursion and other
parameters can be specified. Infinite recursion loops are always
avoided by hashing the retrieved data. All of this works for both
HTTP and FTP.
The retrieval is conveniently traced with printing dots, each dot
representing one kilobyte of data received. Builtin features offer
mechanisms to tune which links you wish to follow (cf. -L, -D and -H).
URL CONVENTIONS
Most of the URL conventions described in RFC1738 are supported. Two
alternative syntaxes are also supported, which means you can use three
forms of address to specify a file:
Normal URL (recommended form):
http://host[:port]/path
http://fly.cc.fer.hr/
ftp://ftp.xemacs.org/pub/xemacs/xemacs-19.14.tar.gz
ftp://username:password@host/dir/file
FTP only (ncftp-like):
hostname:/dir/file
HTTP only (netscape-like):
hostname(:port)/dir/file
You may encode your username and/or password to URL using the form:
ftp://user:password@host/dir/file
If you do not understand these syntaxes, just use the plain ordinary
syntax with which you would call lynx or netscape. Note
that the alternative forms are deprecated, and may cease being
supported in the future.
OPTIONS
There are quite a few command-line options for
wget.
Note that you do not have to know or to use them unless you wish to
change the default behaviour of the program. For simple operations you
need no options at all. It is also a good idea to put frequently used
command-line options in .wgetrc, where they can be stored in a more
readable form.
This is the complete list of options with descriptions, sorted in
descending order of importance:
- -h --help
-
Print a help screen. You will also get help if you do not supply
command-line arguments.
- -V --version
-
Display version of
wget.
- -v --verbose
-
Verbose output, with all the available data. The default output
consists only of saving updates and error messages. If the output is
stdout, verbose is default.
- -q --quiet
-
Quiet mode, with no output at all.
- -d --debug
-
Debug output, and will work only if
wget
was compiled with -DDEBUG. Note that when the program is compiled with
debug output, it is not printed unless you specify -d.
- -i filename --input-file=filename
-
Read URL-s from
filename,
in which case no URL-s need to be on the command line. If there are
URL-s both on the command line and in a filename, those on the
command line are first to be retrieved. The filename need not be an
HTML document (but no harm if it is) - it is enough if the URL-s
are just listed sequentially.
However, if you specify --force-html, the document will be regarded as
HTML. In that case you may have problems with relative links,
which you can solve either by adding <base href="url"> to the document
or by specifying --base=url on the command-line.
- -o logfile --output-file=logfile
-
Log messages to logfile, instead of default stdout. Verbose
output is now the default at logfiles. If you do not wish it, use -nv
(non-verbose).
- -a logfile --append-output=logfile
-
Append to logfile - same as -o, but appends to a logfile (or creating
a new one if the old does not exist) instead of rewriting the old log
file.
- -t num --tries=num
-
Set number of retries to
num.
Specify 0 for infinite retrying.
- -f --follow-ftp
-
Follow FTP links from HTML documents.
- -c --continue-ftp
-
Continue retrieval of FTP documents, from where it was left off. If
you specify "wget -c ftp://sunsite.doc.ic.ac.uk/ls-lR.Z", and there
is already a file named ls-lR.Z in the current directory,
wget
continue retrieval from the offset equal to the length of the existing
file. Note that you do not need to specify this option if the only
thing you want is
wget
to continue retrieving where it left off when the connection is lost -
wget
does this by default. You need this option when you want to continue
retrieval of a file already halfway retrieved, saved by other FTP
software, or left by
wget being killed.
- -g on/off --glob=on/off
-
Turn FTP globbing on or off. By default, globbing will be turned on if
the URL contains a globbing characters (an asterisk, e.g.). Globbing
means you may use the special characters (wildcards) to retrieve more
files from the same directory at once, like wget
ftp://gnjilux.cc.fer.hr/*.msg. Globbing currently works only on UNIX FTP
servers.
- -e command --execute=command
-
Execute command, as if it were a part of .wgetrc file. A
command invoked this way will take precedence over the same command
in .wgetrc, if there is one.
- -N --timestamping
-
Use the so-called time-stamps to determine whether to retrieve a
file. If the last-modification date of the remote file is equal to,
or older than that of local file, and the sizes of files are equal,
the remote file will not be retrieved. This option is useful for
weekly mirroring of
HTTP
or
FTP
sites, since it will not permit downloading of the same file twice.
- -F --force-html
-
When input is read from a file, force it to be HTML. This
enables you to retrieve relative links from existing HTML files
on your local disk, by adding <base href> to HTML, or using
--base.
- -B base_href --base=base_href
-
Use base_href as base reference, as if it were in the file, in
the form <base href="base_href">. Note that the base in the file will
take precedence over the one on the command-line.
- -r --recursive
-
Recursive web-suck. According to the protocol of the URL, this can
mean two things. Recursive retrieval of a HTTP URL means that
Wget
will download the URL you want, parse it as an HTML document (if
an HTML document it is), and retrieve the files this document is
referring to, down to a certain depth (default 5; change it with -l).
Wget
will create a hierarchy of directories locally, corresponding to the
one found on the HTTP server.
This option is ideal for presentations, where slow connections should
be bypassed. The results will be especially good if relative links
were used, since the pages will then work on the new location without
change.
When using this option with an FTP URL, it will retrieve all the
data from the given directory and subdirectories, similar to
HTTP recursive retrieval.
You should be warned that invoking this option may cause grave
overloading of your connection, and your system administrator may
choose not to enable it. The load can be minimized by lowering the
maximal recursion level (see -l) and/or by lowering the number of
retries (see -t).
- -m --mirror
-
Turn on mirroring options. This will set recursion and time-stamping,
combining -r and -N.
- -l depth --level=depth
-
Set recursion depth level to the specified level. Default is 5.
After the given recursion level is reached, the sucking will proceed
from the parent. Thus specifying -r -l1 should equal a recursion-less
retrieve from file. Setting the level to zero makes recursion depth
(theoretically) unlimited. Note that the number of retrieved documents
will increase exponentially with the depth level.
- -H --span-hosts
-
Enable spanning across hosts when doing recursive retrieving. See
-r and -D. Refer to
FOLLOWING LINKS
for a more detailed description.
- -L --relative
-
Follow only relative links. Useful for retrieving a specific homepage
without any distractions, not even those from the same host. Refer to
FOLLOWING LINKS
for a more detailed description.
- -D domain-list --domains=domain-list
-
Set domains to be accepted and DNS looked-up, where domain-list is a
comma-separated list. Note that it does not turn on -H. This speeds
things up, even if only one host is spanned. Refer to
FOLLOWING LINKS
for a more detailed description.
- -A acclist / -R rejlist --accept=acclist / --reject=rejlist
-
Comma-separated list of extensions to accept/reject. For example, if
you wish to download only GIFs and JPEGs, you will use -A gif,jpg,jpeg.
If you wish to download everything except cumbersome MPEGs and .AU
files, you will use -R mpg,mpeg,au.
- -X list --exclude-directories list
-
Comma-separated list of directories to exclude from FTP fetching.
- -P prefix --directory-prefix=prefix
-
Set directory prefix ("." by default) to
prefix. The directory prefix is the directory where all other
files and subdirectories will be saved to.
- -p --prefix-files
-
Set prefixed files. By default,
Wget
will save URLs to appropriate filenames
(e.g. http://yoyodine.com/sharon.gif will be written to
"sharon.gif"). With -p turned on, if a file with the same name already
exists, try "filename.1". If this one also exists, try "filename.2",
etc. This option turns this behaviour off, saving all your files to
"received.n" where n is a number, 1 or greater. This is sometimes
handy for managing large number of files that you can easily
reconstruct. Set file_prefix to change the prefix and -P to change
directory.
- -T value --timeout=value
-
Set the read timeout to a specified value. Whenever a read is issued,
the file descriptor is checked for a possible timeout, which could
otherwise leave a pending connection (uninterrupted read). The default
timeout is 900 seconds (fifteen minutes).
- -Y on/off --proxy=on/off
-
Turn proxy on or off. The proxy is on by default if the appropriate
environmental variable is defined.
- -Q quota[KM] --quota=quota[KM]
-
Specify download quota, in bytes (default), kilobytes or
megabytes. More useful for rc file. See below.
- -O filename --output-document=filename
-
The documents will not be written to the appropriate files, but all
will be appended to a unique file name specified by this option. The
number of tries will be automatically set to 1. If this filename is
`-', the documents will be written to stdout, and --quiet will be
turned on. Use this option with caution, since it turns off all the
diagnostics
Wget
can otherwise give about various errors.
- -S --server-response
-
Print the headers sent by the HTTP server and/or responses sent
by the FTP server.
- -s --save-headers
-
Save the headers sent by the HTTP server to the file, before the
actual contents.
- --header=additional-header
-
Define an additional header. You can define more than additional
headers. Do not try to terminate the header with CR or LF.
- --http-user --http-passwd
-
Use these two options to set username and password
Wget
will send to HTTP servers. Wget supports only the basic
WWW authentication scheme.
- -nc
-
Do not clobber existing files when saving to directory hierarchy
within recursive retrieval of several files. This option is
extremely
useful when you wish to continue where you left off with retrieval.
If the files are .html or (yuck) .htm, it will be loaded from
the disk, and parsed as if they have been retrieved from the Web.
- -nv
-
Non-verbose - turn off verbose without being completely quiet (use
-q for that), which means that error messages and basic information
still get printed.
- -nd
-
Do not create a hierarchy of directories when retrieving
recursively. With this option turned on, all files will get
saved to the current directory, without clobbering (if
a name shows up more than once, the filenames will get
extensions .n).
- -x
-
The opposite of -nd -- Force creation of a hierarchy of directories
even if it would not have been done otherwise.
- -nh
-
Disable time-consuming DNS lookup of almost all hosts. Refer to
FOLLOWING LINKS
for a more detailed description.
- -nH
-
Disable host-prefixed directories. By default, http://fly.cc.fer.hr/
will produce a directory named fly.cc.fer.hr in which everything else
will go. This option disables such behaviour.
- --no-parent
-
Do not ascend to parent directory.
- -k --convert-links
-
Convert the non-relative links to relative ones locally.
FOLLOWING LINKS
Recursive retrieving has a mechanism that allows you to specify which
links
wget
will follow.
- Only relative links
-
When only relative links are followed (option -L), recursive
retrieving will never span hosts.
will never get called, and the process will be very fast, with the
minimum strain of the network. This will suit your needs most of the
time, especially when mirroring the output the output of *2html
converters, which generally produce only relative links.
- Host checking
-
The drawback of following the relative links solely is that humans
often tend to mix them with absolute links to the very same host,
and the very same page. In this mode (which is the default), all
URL-s that refer to the same host will be retrieved.
The problem with this options are the aliases of the hosts and domains.
Thus there is no way for
wget
to know that regoc.srce.hr and www.srce.hr are the same
hosts, or that fly.cc.fer.hr is the same as fly.cc.etf.hr.
Whenever an absolute link is encountered, gethostbyname is
called to check whether we are really on the same host. Although
results of gethostbyname are hashed, so that it will never get
called twice for the same host, it still presents a nuisance e.g. in
the large indexes of difference hosts, when each of them has to be
looked up. You can use -nh to prevent such complex checking, and then
wget
will just compare the hostname. Things will run much faster, but
also much less reliable.
- Domain acceptance
-
With the -D option you may specify domains that will be followed.
The nice thing about this option is that hosts that are not from
those domains will not get DNS-looked up. Thus you may specify
-Dmit.edu,
just to make sure that nothing outside .mit.edu gets looked up.
This is very important and useful. It also means that -D does
not imply -H (it must be explicitly specified). Feel free to use
this option, since it will speed things up greatly, with almost all
the reliability of host checking of all hosts.
Of course, domain acceptance can be used to limit the retrieval to
particular domains, but freely spanning hosts within the domain,
but then you must explicitly specify -H.
- All hosts
-
When -H is specified without -D, all hosts are being spanned. It is
useful to set the recursion level to a small value in those cases.
Such option is rarely useful.
- FTP
-
The rules for
FTP
are somewhat specific, since they have to be. To have
FTP
links followed from
HTML
documents, you must specify -f (follow_ftp). If you do specify it,
FTP
links will be able to span hosts even if span_hosts is not set.
Option relative_only (-L) has no effect on
FTP.
However, domain acceptance (-D) and suffix rules (-A/-R) still apply.
STARTUP FILE
Wget
supports the use of initialization file
.wgetrc.
First a system-wide init file will be looked for
(/usr/local/lib/wgetrc by default) and loaded. Then the user's file
will be searched for in two places: In the environmental variable
WGETRC (which is presumed to hold the full pathname) and
$HOME/.wgetrc.
Note that the settings in user's startup file may override the system
settings, which includes the quota settings (he he).
The syntax of each line of startup file is simple:
variable = value
Valid values are different for different variables. The complete set
of commands is listed below, the letter after equation-sign denoting
the value the command takes. It is on/off for on or
off (which can also be 1 or 0), string for any
string or N for positive integer. For example, you may specify
"use_proxy = off" to disable use of proxy servers by default. You may
use inf for infinite value (the role of 0 on the command
line), where appropriate. The commands are case-insensitive and
underscore-insensitive, thus File__Prefix is the same as
fileprefix. Empty lines, lines consisting of spaces, or lines
beginning with '#' are skipped.
Most of the commands have their equivalent command-line option,
except some more obscure or rarely used ones. A sample init file is
provided in the distribution, named sample.wgetrc.
- accept/reject = string
-
Same as -A/-R.
- add_hostdir = on/off
-
Enable/disable host-prefixed hostnames. -nH disables it.
- always_rest = on/off
-
Enable/disable continuation of the retrieval, the same as -c.
- base = string
-
Set base for relative URL-s, the same as -B.
- convert links = on/off
-
Convert non-relative links locally. The same as -k.
- debug = on/off
-
Debug mode, same as -d.
- dir_mode = N
-
Set permission modes of created subdirectories (default is 755).
- dir_prefix = string
-
Top of directory tree, the same as -P.
- dirstruct = on/off
-
Turning dirstruct on or off, the same as -x or -nd, respectively.
- domains = string
-
Same as -D.
- file_prefix = string
-
Set prefix for output files. It works only if -p is set.
Wget
saves all the files retrieved from the net in files received.n, where
n is a number. If a file named received.n exists,
wget
tries with n + 1, and so forth. The file_prefix option changes the
default "received" prefix.
- follow_ftp = on/off
-
Follow
FTP
links from
HTML
documents, the same as -f.
- force_html = on/off
-
If set to on, force the input filename to be regarded as an HTML
document, the same as -F.
- ftp_proxy = string
-
Use the string as FTP proxy, instead of the one specified in
environment.
- glob = on/off
-
Turn globbing on/off, the same as -g.
- header = string
-
Define an additional header, like --header.
- http_passwd = string
-
Set HTTP password.
- http_proxy = string
-
Use the string as HTTP proxy, instead of the one specified in
environment.
- http_user = string
-
Set HTTP user.
- input = string
-
Read the URL-s from filename, like -i.
- kill_longer = on/off
-
Consider data longer than specified in content-length header
as invalid (and retry getting it). The default behaviour is to save
as much data as there is, provided there is more than or equal
to the value in content-length.
- logfile = string
-
Set logfile, the same as -o.
- login = string
-
Your user name on the remote machine, for
FTP.
Defaults to "anonymous".
- mirror = on/off
-
Turn mirroring on/off. The same as -m.
- noclobber = on/off
-
Same as -nc.
- no_parent = on/off
-
Same as --no-parent.
- no_proxy = string
-
Use the string as the comma-separated list of domains to avoid in
proxy loading, instead of the one specified in environment.
- num_tries = N
-
Set number of retries per URL, the same as -t.
- output_document = string
-
Set the output filename, the same as -O.
- passwd = string
-
Your password on the remote machine, for
FTP.
Defaults to
username@hostname.domainname.
- prefix_files = on/off
-
Set prefixed files, the same as -p.
- quiet = on/off
-
Quiet mode, the same as -q.
- quota = quota
-
Specify the download quota, which is useful to put in
/usr/local/lib/wgetrc. When download quota is specified,
wget
will stop retrieving after the download sum has become greater than
quota. The quota can be specified in bytes (default), kbytes ('k'
appended) or mbytes ('m' appended). Thus "quota = 5m" will set the
quota to 5 mbytes. Note that the user's startup file overrides system
settings.
- reclevel = N
-
Recursion level, the same as -l.
- recursive = on/off
-
Recursive on/off, the same as -r.
- relative_only = on/off
-
Follow only relative links (the same as -L). Refer to section
FOLLOWING LINKS
for a more detailed description.
- robots = on/off
-
Use (or not) robots.txt file.
- server_response = on/off
-
Choose whether or not to print the HTTP and FTP server
responses, the same as -S.
- simple_host_check = on/off
-
Same as -nh.
- span_hosts = on/off
-
Same as -H.
- timeout = N
-
Set timeout value, the same as -T.
- timestamping = on/off
-
Turn timestamping on/off. The same as -N.
- use_proxy = on/off
-
Turn proxy support on/off. The same as -Y.
- verbose = on/off
-
Turn verbose on/off, the same as -v/-nv.
SIGNALS
Wget
will catch the SIGHUP (hangup signal) and ignore it. If the
output was on stdout, it will be redirected to a file named
wget-log. This is also convenient when you wish to redirect
the output of Wget interactively.
$ wget http://www.ifi.uio.no/~larsi/gnus.tar.gz &
$ kill -HUP %% # to redirect the output
Wget will not try to handle any signals other than
SIGHUP. Thus you may interrupt Wget using ^C or
SIGTERM.
EXAMPLES
Get URL http://fly.cc.fer.hr/:
wget http://fly.cc.fer.hr/
Force non-verbose output:
wget -nv http://fly.cc.fer.hr/
Unlimit number of retries:
wget -t0 http://www.yahoo.com/
Create a mirror image of fly's web (with the same directory structure
the original has), up to six recursion levels, with only one try per
document, saving the verbose output to log file 'log':
wget -r -l6 -t1 -o log http://fly.cc.fer.hr/
Retrieve from yahoo host only (depth 50):
wget -r -l50 http://www.yahoo.com/
ENVIRONMENT
http_proxy,
ftp_proxy,
no_proxy,
WGETRC,
HOME
FILES
/usr/local/lib/wgetrc,
$HOME/.wgetrc
UNRESTRICTIONS
Wget
is free; anyone may redistribute copies of
Wget
to anyone under the terms stated in the General Public License, a copy
of which accompanies each copy of
Wget.
SEE ALSO
lynx(1),
ftp(1)
AUTHOR
Hrvoje Niksic <hniksic@srce.hr> is the author of Wget. Thanks to the
beta testers and all the other people who helped with useful
suggestions.
Index
- NAME
-
- SYNOPSIS
-
- WARNING
-
- DESCRIPTION
-
- URL CONVENTIONS
-
- OPTIONS
-
- FOLLOWING LINKS
-
- STARTUP FILE
-
- SIGNALS
-
- EXAMPLES
-
- ENVIRONMENT
-
- FILES
-
- UNRESTRICTIONS
-
- SEE ALSO
-
- AUTHOR
-
This document was created by
man2html,
using the manual pages.
Time: 05:58:36 GMT, July 26, 2024